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Abstract 

We present global convergence rates for a line-search method which is based on random 
first-order models and directions whose quality is ensured only with certain probability. We 
show that in terms of the order of the accuracy, the evaluation complexity of such a method is 
the same as its counterparts that use deterministic accurate models; the use of probabilistic 
models only increases the complexity by a constant, which depends on the probability of the 
models being good. We particularize and improve these results in the convex and strongly 
convex case. 

We also analyze a probabilistic cubic regularization variant that allows approximate prob¬ 
abilistic second-order models and show improved complexity bounds compared to probabilis¬ 
tic hrst-order methods; again, as a function of the accuracy, the probabilistic cubic regular¬ 
ization bounds are of the same (optimal) order as for the deterministic case. 


Keywords: line-search methods, cubic regularization methods, random models, global convergence 

analysis. 


1 Introduction 

We consider in this paper the unconstrained optimization problem 

min f{x), 

where the first (and second, when specified) derivatives of the objective function f{x) are as¬ 
sumed to exist and be (globally) Lipschitz continuous. 

Most unconstrained optimization methods rely on approximate local information to compute 
a local descent step in such a way that sufficient decrease of the objective function is achieved. 
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To ensure such sufficient decrease, the step has to satisfy certain requirements. Often in practical 
applications ensuring these requirements for each step is prohibitively expensive or impossible. 
This may be due to the fact that derivative information about the objective function is not 
available or because full gradient (and Hessian) are too expensive to compute, or a model of the 
objective function is too expensive to optimize accurately. 

Recently, there has been a signihcant increase in interest in unconstrained optimization 
methods with inexact information. Some of these methods consider the case when gradient 
information is inaccurate. This error in the gradient computation may simply be bounded 
in the worst case (deterministically), see, for example, [111 I20j . or the error is random and 
the estimated gradient is accurate in expectation, as in stochastic gradient algorithms, see for 
example, [121 Ea i23i m]. These methods are typically applied in a convex setting and do not 
extend to nonconex cases. Complexity bounds are derived that bound the expected accuracy 
that is achieved after a given number of iterations. 

In the nonlinear optimization setting, the complexity of various unconstrained methods has 
been derived under exact derivative information laiHiin!, and also under inexact information, 
where the errors are bounded in a deterministic fashion |31El[IIl[Ill[2n]. In all the cases of the 
deterministic inexact setting, traditional optimization algorithms such as line search, trust region 
or adaptive regularization algorithms are applied with little modification and work in practice 
as well as in theory, while the error is assumed to be bounded in some decaying manner at 
each iteration. In contrast, the methods based on stochastic estimates of the derivatives, do not 
assume deterministically bounded errors, however they are quite different from the ’’traditional” 
methods in their strategy for step size selection and averaging of the iterates. In other words, 
they are not simple counterparts of the deterministic methods. 

Our purpose in this paper is to derive a class of methods which inherit the best properties 
of traditional deterministic algorithms, and yet relax the assumption that the derivative/model 
error is bounded in a deterministic manner. Moreover, we do not assume that the error is zero 
in expectation or that it has a bounded variance. Our results apply in the setting where at each 
iteration, with sufficiently high probability, the error is bounded in a decaying manner, while in 
the remaining cases, this error can be arbitrarily large. In this paper, we assume that the error 
may happen in the computation of the derivatives and search directions, but that there is no 
error in the function evaluations, when success of an iterate has to be validated. 

Recently several methods for unconstrained black-box optimization have been proposed, 
which rely on random models or directions mini [IS], but are applied to deterministic functions. 
In this paper we take this line of work one step further by establishing expected convergence 
rates for several schemes based on one generic analytical framework. 

We consider four cases and derive four different complexity bounds. In particular, we analyze 
a line search method based on random models, for the cases of general nonconex, convex and 
strongly convex functions. We also analyze a second order method - an adaptive regularization 
method with cubics mu - which is known to achieve the optimal convergence rate for the 
nonconvex smooth functions |3] and we show that the same convergence rate holds in expectation. 

In summary, our results differ from existing literature using inexact, stochastic or random 
information in the following main points: 

• Our models are assumed to be ’’good” with some probability, but there is no other as¬ 
sumptions on the expected values or variance of the model parameters. 

• The methods that we analyze are essentially the exact counterparts of the deterministic 
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methods, and do not require averaging of the iterates or any other significant changes. 
We believe that, amongst other things, our analysis helps to understand the convergence 
properties of practical algorithms, that do not always seek to ensure theoretically required 
model quality. 

• Our main convergence rate results provide a bound on the expected number of iterations 
that the algorithms take before they achieve a desired level of accuracy. This is in contrast 
to a typical analysis of randomized or stochastic methods, where what is bounded is the 
expected error after a given number of iterations. Both bounds are useful, but we believe 
that the bound on the expected number of steps is a somewhat more meaningful complexity 
bound in our setting. The only other work that we are aware of which provides bounds in 
terms of the number of required steps is m where probabilistic bounds are derived in the 
particular context of random direct search with possible extension to trust region methods 
as discussed in Section 6 of m- 

An additional goal of this paper is to present a general theoretical framework, which could 
be used to analyze the behavior of other algorithms, and different possible model construction 
mechanisms under the assumption that the objective function is deterministic. We propose a 
general analysis of an optimization scheme by reducing it to the analysis of a stochastic process. 
Convergence results for a trust region method in [T] also rely on a stochastic process analysis, 
but only in terms of behavior in the limit. These results have now been extended to noisy 
(stochastic) functions, see OdO]. Deriving convergence rates for methods applied to stochastic 
functions is the subject of future work and is likely to depend on the results in this paper. 

The rest of the paper is organized as follows. In Section [2] we describe the general scheme 
which encompasses several unconstrained optimization methods. This scheme is based on using 
random models, which are assumed to satisfy some ’’quality” conditions with probability at 
least p, conditioned on the past. Applying this optimization scheme results in a stochastic 
process, whose behavior is analyzed in the later parts of Section [2j Analysis of the stochastic 
process allows us to bound the expected number of steps of our generic scheme until a desired 
accuracy is reached. In Section [3] we analyze a linesearch algorithm based on random models 
and show how its behavior fits into our general framework for the cases of nonconex, convex and 
strongly convex functions. In Section H] we apply our generic analysis to the case of the Adaptive 
Regularization method with Cubics (ARC). Finally, in Section [5] we describe different settings 
where the models of the objective functions satisfy the probabilistic conditions of our schemes. 

2 A general optimization scheme with random models 

This section presents the main features of our algorithms and analysis, in a general framework 
that we will, in subsequent sections, particularize to specific algorithms (such as linesearch and 
cubic regularization) and classes of functions (convex, nonconvex). The reasons for the initial 
generic approach is to avoid repetition of the common elements of the analysis for the different 
algorithms and to emphasize the key ingredients of our analysis, which is possibly applicable to 
other algorithms (provided they satisfy our framework). 
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2.1 A general optimization scheme 

We first describe a generic algorithmic framework that encompasses the main components of 
the unconstrained optimization schemes we analyze in this paper. The scheme relies on building 
a model of the objective function at each iteration, minimizing this model or reducing it in 
a sufficient manner and considering the step which is dependent on a stepsize parameter and 
which provides the model reduction (the stepsize parameter may be present in the model or 
independent of it). This step determines a new candidate point. The function value is then 
computed (accurately) at the new candidate point. If the function reduction provided by the 
candidate point is deemed sufficient, then the iteration is declared successful, the candidate 
point becomes the new iterate and the step size parameter is increased. Otherwise, the iteration 
is unsuccessful, the iterate is not updated and the step size parameter is reduced. 

We summarize the main steps of the scheme below. 

Algorithm 2.1 Generic optimization framework based on random models 


Initialization 

Choose a class of (possibly random) models mk{x), choose constants 7 G (0,1), 9 G (0,1), 
ckmax > 0. Initialize the algorithm by choosing xq, mo{x), 0 < oq < Omax- 

1. Compnte a model and a step 

Compute a local (possibly random) model mk{x) of f around x^. 

Compute a step s^{ak) which reduces mk{x), where the parameter ak > 0 is present in the 
model or in the step calculation. 

2. Check sufficient decrease 

Compute f{x^ + s^{ak)) and check if sufficient reduction (parametrized by 6) is achieved 
in f with respect to mk{x^) — mk{x^ + s^{ak)). 

3. Successful step 

If sufficient reduction is achieved then, x^~^^ := x^ + s^{ak), setak+i = min{amax) 

Let k ■.= k + 1. 

4. Unsuccessful step 

Otherwise, := x^, set Ofc+i = yofc. Let k ■.= k + 1. 


Let us illustrate how the above scheme relates to standard optimization methods. In line- 
search methods, one minimizes a linear model mk{x) = f{x^) + (x — x^)'^g^ (subject to some 
normalization), or a quadratic one mk{x) = f{x^) + (x — x^)'^g^ + ^(x — x^)~^b^{x — x^) (when 
the latter is well-defined, with b^ - a Hessian approximation matrix), to find directions d^ = —g^ 
or d^ = —{b’‘)~^g^, respectively. Then the step is defined as s^{ak) = for some ak and, 
commonly, the (Armijo) decrease condition is checked, 

f{x^) - + ■s^(afc)) > -0s’^{akf g^, 

where —9s^{ak)'^g^ is a multiple of mk{x^) — mk{x^ + s^{ak)). Note that if the model stays the 
same in that mk{x) = mfc-i(x) for each k, such that {k — l)st iteration is unsuccessful, then the 
above framework essentially reduces to a standard deterministic linesearch. 
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In the case of cubic regularization methods, s^{ak) is computed to approximately minimize 
a cubic model mk{x) = f[x^) + (x — x^)'^g^ + ^{x — x^)~^b^{x — x^) + 3 ^Ik — x^\\^ and the 
sufficient decrease condition is 

/(x^)-/(x^ + skafc)) > 0 ^ Q 
m{x^) — m{x^ + [ak]) ~ 

Note that here as well, in the deterministic case, g^ = g^~^ and b^ = b^~^ for each k such that 
{k — l)st iteration is unsuccessful but ak 7 ^ C(k-i- 

The key assumption in the usual deterministic case is that the models mk{x) are sufficiently 
accurate in a small neighborhood of the current iterate x^. The goal of this paper is to relax 
this requirement and allow the use of random local models which are accurate only with certain 
probability (conditioned on the past). In that case, note that the models need to be re-drawn 
after each iteration, whether successful or not. 

Note that our general setting includes the cases when the model (the derivative information, 
for example) is always accurate, but the step is computed approximately, in a probabilistic 
manner. For example, can be an approximation of —{b^)~^g^. It is easy to see how ran¬ 
domness in calculation can be viewed as the randomness in the model, by considering that 
instead of the accurate model 

f{x^) + {x- x^) V + - x^)'^b^{x - x^) 

we use an approximate model 

fnk{x) = /(x^) — (x — x^)'^b^s^ + ^(x — x^y'b^{x — x^). 

Hence, as long as the accuracy requirements are carried over accordingly the approximate random 
models subsume the case of approximate random step computations. The next section makes 
precise our requirements on the probabilistic models. 

2.2 Generic probabilistic models 

We will now introduce the key probabilistic ingredients of our scheme. In particular we assume 
that our models ruk are random and that they satisfy some notion of good quality with some 
probability p. We will consider random models Mk, and then use the notation nik = Mk{ujk) for 
their realizations. The randomness of the models will imply the randomness of the points x^, 
the step length parameter ak, the computed steps and other quantities produced by the 
algorithm. Thus, in our paper, these random variables will be denoted by X^, Ak, and so 
on, respectively, while x^ = X^{iOk), oik = = S^{LOk), etc, denote their realizations 

(we will omit the ujk in the notation for brevity). 

For each specific optimization method, we will define a notion of sufficiently accurate models. 
The desired accuracy of the model depends on the current iterate x^, step parameter ak and, 
possibly, the step s^{ak)- This notion involves model properties which make sufficient decrease 
in / achievable by the step s^{ak)- Specific conditions on the models will be stated for each 
algorithm in the respective sections and how these conditions may be achieved will be discussed 
in Section [5l 
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Definition 2.1 [sufficiently accurate models; true and false iterations] We say that a 
sequence of random models {M^} is {p)-probabilistically “sufficiently accurate” for a correspond¬ 
ing sequence if the following indicator random variable 

Ik = is a sufficiently accurate model of f for the given and Ak) 

satisfy the following submartingale-like condition 

P{h = l\Fjf_,) > p, (1) 

where = cr(Mo,..., M^-i) is the a-algebra generated by Mq, ..., M^-i - in other words, 

the history of the algorithm up to iteration k. 

We say that iteration k is a true iteration if the event /^ = 1 occurs. Otherwise the iteration 
is called false. 

Note that is a random model that, given the past history, encompasses all the randomness 
of iteration k of our algorithm. The iterates and the step length parameter.Afc are random 
variables defined over the cr-algebra generated by Mq, ... ,Mk-i. Each depends on X’^ and 
Ak and hence on Mq, ..., Mk-i- Definition 12.11 serves to enforce the following property: even 
though the accuracy of may be dependent on the history, (Mi,..., Mk-i), via its dependence 
on X^ and Ak , it is sufficiently good with probability at least p, regardless of that history. This 
condition is more reasonable than complete independence of from the past, which is difficult 
to ensure. It is important to note that, from this assumption, it follows that whether or not the 
step is deemed successful and the iterate is updated, our scheme always updates the model 
mk, unless is somehow known to be sufficiently accurate for and ak+i- We will 

discuss this in more detail in Section (5] 

When Algorithm 12.II is based on probabilistic models (and all its specific variants under con¬ 
sideration), it results in a discrete time stochastic process. This stochastic process encompasses 
random elements such Ak, X^, , which are directly computed by the algorithm, but also 

some quantities that can be derived as functions of Ak, X^, , such as f{X^), ||V/(Ai^)|| and 

a quantity Fk, which we will use to denote some measure of progress towards optimality. Each 
realization of the sequence of random models results in a realization of the algorithm, which in 
turn produces the corresponding sequences {ak}, {x^}, {s^}, {/(x^)}, {||V/(x^)||} and {/fc}0- 
We will analyze the stochastic processes restricting our attention to some of the random quan¬ 
tities that belong to this process and will ignore the rest, for the brevity of the presentation. 
Hence when we say that Algorithm 12.11 generates the stochastic process {X^,Ak}, this means 
we want to focus on the properties of these random variables, but keeping in mind that there 
are other random quantities in this stochastic process. 

We will derive complexity bounds for each algorithm in the following sense. We will define 
the accuracy goal that we aim to reach and then we will bound the expected number of steps 
that the algorithm takes until this goal is achieved. The analyses will follow common steps, 
and the main ingredients are described below. We then apply these steps to each case under 
consideration. 

2.3 Elements of global convergence rate analysis 

First we recall a standard notion from stochastic processes. 

^Note that throughout, f- fk, since fk is a related measure of progress towards optimality. 
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Hitting time. For a given discrete time stochastic process, Zt, recall the concept of a hitting 
time for an event {Zt G S}. This is a random variable, dehned as Ts = min{t : Zf ^ S} - the 
hrst time the event {Zt G S} occurs. In our context, set S will either be a set of real numbers 
larger than some given value, or smaller than some other given value. 

Number of iterations to reach e accuracy. Given a level of accuracy e, we aim to 
derive a bound on the expected number of iterations E(A^(;) which occur in the algorithm until 
the given accuracy level is reached. The number of iterations Nf, is a random variable, which can 
be defined as a hitting time of some stochastic process, dependent on the case under analysis. 
In particular, 

• If /(x) is not known to be convex, then is the hitting time for {||V/(Xfc)|| < e}, namely, 
the number of steps the algorithm takes until ||V/(X^)|| < e occurs for the first time. 

• If f{x) is convex or strongly convex then is the hitting time for {f{X^) — /* < e}, 
namely, the number of steps the algorithm takes until f{X^) — /* < e occurs for the first 
time, where /* = f{x*) with x*, a global minimizer of /. 

We will bound E(A'"e) by observing that for all k < the stochastic process induced by 
Algorithm 12.11 behaves in a certain way. To formalize this, we need to define the following 
random variable and its upper bound. 

Measure of progress towards optimality, F^. This measure is defined by the total function 
decrease or by the distance to the optimum. In particular, 

• If f{x) is not known to be convex, then = f{X^) — f{X^). 

• If f{x) is convex, then = l/(/(A^) — /*). 

• If f{x) is strongly convex, then = log(l/(/(A^) — /*)). 

Upper bouud F^ ou F^. From the algorithm construction, F^ defined above is always non¬ 
decreasing and there exists a deterministic upper bound F^ in each case, defined as follows. 

• If f{x) is not known to be convex, then F^ = f{X^) — /*, where /* is a global lower bound 
on /. 

• If f{x) is convex, then Fg = 1/e. 

• If f{x) is strongly convex, then F^ = log(l/e). 

We observe that F^ is a nondecreasing process and F^ is the largest possible value that F^ 
can achieve. 

Our analysis will be based on the following observations, which are borrowed from the global 
rate analysis of the deterministic methods |15] . 

• Guaranteed amount of increase in f^. For all k < (he., until the desired accuracy 

has been reached), if the kth iteration is true and successful, then fk is increased by an 
amount proportional to a^. 
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• Guaranteed threshold for There exists a constant, which we will call C, such that 

if Ofc < C* and the /cth iteration is true, then the /cth iteration is also successful, and hence 
Ok+i = This constant C depends on the algorithm and Lipschitz constants of /. 

• Bound on the number of iterations. If all iterations were true, then by the above 
observations, ak > 'yC and, hence, fk increases by at least a constant for all k. Prom this 
a bound on the number of iterations, knowing that cannot exceed -Fg. 

In our case not all iterations are true, however, under the assumption that they “tend” to be 
true, as we will show, when Ak < C, then iterations “tend” to be successful, Ak “tends” to 
stay near the value C and the values Fk “tend” to increase by a constant. The analysis is then 
performed via a study of stochastic processes, which we describe in detail next. 


2.4 Analysis of the stochastic processes 

Let us consider the stochastic process {Ak,Fk} generated by Algorithm 12.11 using random, p- 
probabilistically sufficiently accurate models Mk, with Fk defined above. Under the assumption 
that the sequence of models Mk are ^-probabilistically sufficiently accurate, each iteration is 
true with probability at least p, conditioned on the past. 

We assume now (and we show later for each specific case) that {Ak,Fk} obeys the following 
rules for all k < N^. 

Assumption 2.1 There exist a constant C > 0 and a nondecreasing function h{a), a € M, 
which satisfies h{a) > 0 for any a > 0, such that for any realization of Algorithm \2.1\ the 
following hold for all k < N^: 

(i) If iteration k is true (i.e. Ik = 1) and successful, then fk+i > A + h{ak)- 

(ii) If Oik F C and iteration k is true then iteration k is also successful, which implies ak+i = 
l~^Oik. 

(in) fk+i > fk for all k. 

For future use let us state an auxiliary lemma. 


Lemma 2.1 Let be the hitting time as defined on page For all k < let Ik be the 
sequence of random variables in Definition \2.1\ so that © holds. Let Wk be a nonnegative 
stochastic process such that a{Wk) C for any A; > 0. Then 




CN,-1 


E Wkh > pE ^ Wfc . 


k=0 


k=0 


Similarly, 


CN,-1 


CN,-1 


Y^Wk{l-Ik)] <( 1 -P)E . 


k=0 


k=0 


E 


Proof. The proof is a simple consequence of properties of expectations, see for example, p2l 
property H*, page 216], 

E(4| Wk) = E(E(4| Wk) > E{p\ Wk) > p, 

where we also used that (T{Wk) C Hence by the law of total expectation, we have 

E(H44) = E(H4E(4|H4)) > pK{Wk). Similarly, we can derive E(l{/c < Nf\WkIk) > 
pE(l{/c < Nf\Wk), because l{k < N^} is also determined by PjfLi- Finally, 


/N,-l \ / oo \ / oo \ /N,-l 

E Wkh = E ^ l{fc < N,}WkIk > pE l{fc < N,}Wk = pE PFfc 

V k=0 J \k=0 / \k=0 / V k=0 

The second inequality is proved analogously. 


□ 


Let us now dehne two indicator random variables, in addition to 4 defined earlier, 

Afc = l{Ak > C*}, 

and 

Qk = 1 {Iteration k is successful i.e., Ak+i = 7 ^Ak}. 

Note that cr(Afc) C and cr(0fc) C , that is the random variable A^ is fully determined 
by the first k — 1 steps of the algorithm, while 0^, is fully determined by the first k steps. We 
will use Afc, ik and 9k to denote realizations of A^, Ik and 0 ^, respectively. 

These indicators will help us define our algorithm more rigorously as a stochastic process. 
Without loss of generality, we assume that C = < 7 «max for some positive integer c. In 

other words, C is the largest value that the step size Ak actually achieves for which part (ii) 
of Assumption 12.11 holds. The condition C < yamax is a simple technical condition, which is 
not necessary, but which simplifies the presentation later in this section. Under Assumption 
12.11 recalling the update rules for Ok in Algorithm 12.11 and the assumption that true iterations 
occur with probability at least p, we can write the stochastic process {Ak,Fk} as obeying the 
expressions below: 




fc+i 


= 




Fk+l > < 


l~^Ak 
lAk 

min{amax,7 
lAk 

Fk + h{Ak) 

Fk 

Fk + h{Ak) 

Fk 


if 4 = 1 and Afc = 0 , 
if 4 = 0 and A^ = 0 , 
Ak} if 0 fc = 1 and A^, = 1, 
if 0 fc = 0 and A^ = 1 , 

if 4 = 1 and A^ = 0 , 
if 4 = 0 and A^ = 0 , 
0 fc 4 = 1 and Afc = 1 , 
0 fc 4 = 0 and Afc = 1 . 


( 2 ) 


(3) 


We conclude that, when Ak < 4, a successful iteration happens with probability at least p, 
and in that case Ak+i = and that an unsuccessful iteration happens with probability 

at most 1 — p, in which case Ak+i = ^Ak- Note that there is no known probability bound for 
the different outcomes when Ak > C. However, we know that 4 = 1 with probability at least p 
and if, in addition, iteration k happens to be successful, then Fk is increased by at least h[Ak). 

In summary, from the above discussion, we have 


for all k < W, Alaorithm \2. 1\ under Assumvtion \2. 1\ yields 
the stochastic process {Ak,Fk} in {S} and Q. 
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2.5 Bounding the number of steps for which ak < C 

In this subsection we derive a bound on E — A^)^. The bound for E(^^g^Afc) will 

be derived in the next section. 

The following simple result holds for every realization of the algorithm and stochastic process 
{Afc, Iki ©fe}' 

Lemma 2.2 For any I G {0,..., — 1} and for all realizations of Alaorithrn A2. A we have 

^ 1 

“ Afe)0fc < -{I+1 ). 

/c=0 

Proof. By the dehnition of and 0^ we know that when (1 — Ak)&k = 1 then we have 
a successful iteration and Ak < C. In this case Ak+i = ^~^Ak. It follows that amongst all 
iterations, at most half can be successful and have Ak < C, because for each such iteration, 
when Ak gets increased by a factor of 7 “^, there has to be at least one iteration when Ak is 
decreased by the same factor, since Ao> C. □ 


Using this we derive the bound. 

Lemma 2.3 




Proof. By Lemma l2. II applied to Wk = 1 — A^ we have 




/N,-l 


E (1 - Ak)Ik > pE (I - Afc) 


fc =0 


k=0 


From the fact that all true iterations are successful when ak < C, 

Afe-l N^-1 


Y.{i-Ak)ik< J2{i-Ak)e 


k- 


k=0 


k=0 


Finally, from Lemma l2. 2 


N,-l 


^ (1 - Afc)4 < xA^e 


k=0 


(4) 


(5) 


( 6 ) 


Taking expectations in ([5|) and (l 6 |) and combining with Q, we obtain the result of the 
lemma. □ 


2.6 Bounding the expected number of steps for which ak > C 

Let us now consider the bound on E (j2k=o^-^k^- introduce the additional notation = 

l{Ak > C} + l{Ak = C}. In other words A^ = 1 when either A^ = 1 or Ak = C. We now 
define: 
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• A^i = '^k=o^^k{^ — Ik)&k, which is the number of false successful iterations, when Ak > C. 

• Ml = — Ik), which is the number of false iterations, when Ak > C. 

• N 2 = which is the number of true successful iterations, when Ak > C. 

• M 2 = which is the number of true iterations, when Ak > C. 

• -^3 = — 0 fc), which is the number of true unsuccessful iterations, when 

Ak > C. 

• M 3 = X]£o^Afc(l — 0fc), which is the number of unsuccessful iterations, when Ak > C. 

Since E = E (E£o'Afc(l - h)) +E (E^o'Afc/fc) < E(Mi) + E(M 2 ), our goal 

is to bound E(Mi) + E(M 2 ). 

Our next observation is simple but central in our analysis. It reflects the fact that the gain 
in Fk is bounded from above by and when Ak > C this gain is bounded from below as well, 
hence allowing us to bound the total number of true successful iterations when Ak > C. The 
following two lemmas holds for every realization. 

Lemma 2.4 For any I G {0,..., — 1} and for all realizations of Algorithm \2.1\ we have 

gA»40» < 


and so 


N2 < 


F, 


h{C) 


(7) 


Proof. Consider any k for which AkIkQk = 1- From Assumption 12.11 we know that whenever 
an iteration is true and successful then Fk get increased by at least h{Ak) > h{C), since Ak > C 
and h is nondecreasing. We also know that on other iterations Fk does not decrease. The bound 
Fk If F^ trivially gives us the desired result. □ 


Another key observation is that 


M2 < N2 + < N 2 -\- M3, (8) 

where the first inequality follows from the fact that for all k < and for all realizations, 
(Afc — Afc)/fc(l — Qk) = 0) in other words there are no true unsuccessful iterations when Ak = C. 

Lemma 2.5 For any I G {0,..., W ~ 1} ond for all realizations of Alaorithm \2. R we have 

i i 

^ Afc(l - Qk) < ^ Afc0fc + log.^ 

k=0 k=0 
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Proof. Ak is increased on successful iterations and decreased on unsuccessful ones. Hence the 
total number of steps when Ak > C and Ak is decreased, is bounded by the total number of 
steps when Ak > C is increased plus the number of steps it is required to reduce Ak from its 
initial value ao to C. □ 


From Lemma [23] applied to / = — 1, we can deduce that 

M3 < A''! + ^"2 + ^og^iC/ao). 


We also have the following lemma. 

Lemma 2.6 

E(Mi) < 

P 

Proof. By applying both inequalities in Lemma l2.II with Wk = A^, we obtain 


and 


which gives us 


/N,-l 


/N,-l 


E ^ Afc4 > pE ^ A* 


k=0 


k=0 




/N,-l 


E Afc(l - 4) < (1 -p)E Ak 


k=0 


k=0 


/Ne-1_ \ . _ 

E V Afc(4 - 1) < —V Akh 

J P \ to 


(9) 


( 10 ) 


□ 


Lemma 2.7 Under the eondition that p > 1/2, we have 

/N,-i \ 


E < 


24 


k=0 


h{C){2p-l) 


+ 


log.^(C'/ao) 

2p- 1 


Proof. Recall that E ^X)£o^Afcj = E(Mi + M 2 ). Using ([5]) and (ITUD it follows that 

E(4i) < E(Mi) < ^-^E(M2) < ^^^E(N2 + Ms) = ^^[E(42) + ^(Ms)]. (11) 

p p p 

Taking into account ([9|) and using the bound ([7|) on N 2 we have 

E(M3) < E(4i) + E(42) + log.,(C'/ao) < E(4i) + Fjh{C) + log.,(C/ao). (12) 
Plugging this into (Hip and using the bound ([7|) on N 2 again, we obtain 


E(4i) < 


1 — p 


4 


p lh{C) 


+ E(4i) + 


h{C) 


+ log. 


-) 

aoj 
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and, hence, 


— — -E{Ni) < ^ ^ 


2K 


P [h{C) 


+ log. 


This finally implies 


E(iVi) < 


1 -P 
2p-l 


2F, 

h{C) 


c_ 

ao 


( 13 ) 


Now we can bound the expected total number of iterations when > C, using 0 , (m and 
()13p and adding the terms to obtain the result of the lemma, namely, 


1 1 / 2^F ( 

E(Mi + Ma) < E(Mi + Mg + iVa) < -E(M3 + A^a) < x - 7 -rrk + 

P 2p- 1 \h[C) ^ V 


-) 

ao/ 


□ 


2.7 Final bound on the expected stopping time 

We finally have the following theorem which trivially follows from Lemmas 12.31 and 12.71 


Theorem 2.1 Under the condition that p > 1/2, the hitting time is bounded in expectation 
as follows 


E(W) < 


2p f 2F, 


(2p-l)2 V/i(C) ^ 


+ log.^ ( — 
ao 


Proof. Clearly 


/N,-l 


tN,-l 


E(W)=E +E J^(l-A, 


k=0 


k=0 


and, hence, using Lemmas 12.31 and 12.71 we have 


1 


1 


(. 


2F, 


E(iVA < —E(iVA + 1 

^ - 2p ^ 2p-I \h{C) 


+ log.x 


C_ 

ao 


The result of the theorem easily follows. 


□ 


Summary of our complexity analysis framework. We have considered a(ny) algorithm 
in the framework Algorithm 12.11 with probabilistically sufficiently accurate models as in Defini¬ 
tion EH We have developed a methodology to obtain (complexity) bounds on the number of 
iterations W that such an algorithm takes to reach desired accuracy. It is important to note 
that, while we simply provide the bound on E(W) it is easy to extend the analysis of the same 
stochastic processes to provide bounds on P{N^ > K}, for any K larger than the bound on 
E(W)) in particular it can be shown that FlW > K} decays exponentially with K. 

While in our analysis we assumed that the constant 7 by which we decrease and increase ak 
is the same, our analysis can be quite easily extended to the case when the constants for increase 
and decrease are different, say 'jinc and 'ydec- In this case the threshold on the probability p may 
no longer be 1/2 but will be larger if ^incHdec < 1 and smaller, otherwise. Some of the constants 
in the upper bound on E(W) with change accordingly. 
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Our approach is valid provided that all of the conditions in Assumption 12.11 hold. Next 
we show that all these conditions are satisfied by steepest-descent linesearch methods in the 
nonconvex, convex and strongly convex case; by general linesearch methods in the nonconvex 
case; by cubic regularization methods (ARC) for nonconvex objectives. In particular, we will 
specify what we mean by a probablistically sufficiently accurate first-order and second-order 
model in the case of linesearch and cubic regularization methods, respectively. 

3 The line-search algorithm 

We will now apply the generic analysis outlined in the previous section to the case of the following 
simple probabilistic line-search algorithm. 

Algorithm 3.1 A line-search algorithm with random models 
Initialization 

Choose constants 7 G (0,1), 0 G (0,1) and Omax > 0. Pick initial and oq < «max- 
Repeat for A: = 0,1,... 

1. Compute a model and a step 

Compute a random model rrik and use it to generate a direction . 

Set the step s^ = —a^g^. 

2. Check sufficient decrease 

Check if 

f{x^-aug^)<f{x^)-aueU\\\ (14) 

3. Successful step 

If holds, then x^~^^ := x^ — a^g^ and Ofe+i = min{amax, A: := A: -|- 1 . 

4. Unsuccessful step 

Otherwise, x^^^ := x^, set Ofc+i = 7 crfc- Lat k := k + 1. 

For the linesearch algorithm, the key ingredient is a search direction selection on each iter¬ 
ation. In our case we assume that the search direction is random and satisfies some accuracy 
requirement that we discuss below. The choice of model in this algorithm is a simple linear model 
mk{x), which gives rise to the search direction g^, specifically, mk{x) = f{x^) -|- (x — x^)'^g^. 
We will consider more general models in the next section. Section 13.21 

Recall Definition EH Here we describe the specific requirement we apply to the models in 
the case of line search. 

Definition 3.1 We say that a sequence of random models and corresponding directions {Mk, Gk} 
is (p)-probabilistically ’’sufficiently accurate” for Algorithm \3A\ for a corresponding sequence 
{Afc, A^}, if there exists a constant k> 0, such that the indicator variables 

h = 1{||G''^-V/(A'^)|| 

satisfy the following submartingale-like condition 

p{h = ilA^i) > P, 

where = cr{MQ, ..., Mfc_i) is the a-algebra generated by Mq, ..., . 


14 


As before, each iteration for which 1^ = 1 holds is called a true iteration. It follows that for 
every realization of the algorithm, on all true iterations, we have 


which implies, using Uk < 


U-Vf{x^)\\<Kak\\g% 

and the triangle inequality, that 

ll/ll > 


1 + /tCkmax 

For the remainder of the analysis of Algorithm 13.11 we make the following assumption. 


(15) 


(16) 


Assumption 3.1 The sequence of random models and corresponding directions gen¬ 

erated in Alaorithm \S.li is {p)-probabilistically ’’sufficiently accurate” for the corresponding ran¬ 
dom sequence {Afc, A^}, with p > 1/2. 

We also make a standard assumption on the smoothness of /(x) for the remainder of the 
paper. 

Assumption 3.2 / G C^(M”), is globally bounded below by /*, and has globally Lipschitz 

continuous gradient V/, namely, 

||V/(x) — V/(y)|| < L\\x — y\\ for all x, y G M"' and some L > 0. (17) 


3.1 The nonconvex case, steepest descent 

As mentioned before, our goal in the nonconvex case is to compute a bound on the expected 
number of iterations k that Algorithm 13.II requires to obtain an iterate x^ for which ||V/(x^)|| < 
e. We will now compute the specific quantities and expressions defined in Sections 12.31 and 12.41 
that allow us to apply the analysis of our general framework to the specihc case of Algorithm 
13.11 for nonconvex functions. 

Let Nf; denote, as before, the number of iterations that are taken until ||V/(A^)|| < e 
occurs (which is a random variable). Let us consider the stochastic process with 

Fk = f{x^) — /(A^) and let = f{x^) — /*. Then Fk < F^, for all k. 

Next we show that Assumption 1 2.1 1 is verified. First we derive an expression for the constant 
C, related to the size of the stepsize ak- 


Lemma 3.1 Let Assumvtion 1 5'. HI hold. For every realization of Algorithm \3. A if iteration k is 
true (i.e. Ik = l), and if 


ak <C 


1-9 
0.5L + k’ 


(18) 


then (ffH) holds. In other words, when dm) holds, any true iteration is also a successful one. 


Proof. Condition (|17p implies the following overestimation property for all x and s in M”', 


fix + s) < fix) + s^Vfix) + ^\\sf, 

which implies 

fix^-Ukg^) < /(x^) - afc(5^)^V/(x^) + 
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Applying the Cauchy-Schwarz inequality and (fTHIl we have 


f{x’^-akg’^) < f{x^)-ak{g^Y[Vf{x^)-g^]-ak\\g^f[l-^ak\ 

< f{x^) + auWg^W ■ ||V/(x^) - g^\\ - ak\\g^\\^ [l - f afc] 

< f{x^)-ak\\g^\\^[l-{K + ^)ak]. 

It follows that (full holds whenever f{x^) — afc|| 5 ^|p[l — (k + O.S-L)®^] < f{x^) — akO\\g^\\‘^ which 
is equivalent to (fT5D . □ 

From Lemma 13.11 and from (I14p and (jl 6 p , for any realization of Algorithm 13.11 which gives 
us the specific sequence {ak,fk}, the following hold. 


• If fc is a true and successful iteration, then 


and 


fk+l > /fc + 


e\\Vf{x'^)fak 

(1 + KOmax)^ 


Ofc+l = 7 ^Oik- 


• If < C, where C is defined in (IlSp . and iteration k is true, then it is also successful. 

Hence, Assumption 12 .1 1 holds and the process {Ak, Fk} behaves exactly as our generic process 
dSI)-® in Section [231 with C defined in (fTSP and the specific choice of h{Ak) = yi ■ 

Finally, we use Theorem 12.11 and substituting the expressions for C, h{C) and into the 
bound on we obtain the following complexity result. 


Theorem 3.1 Let Assumvtions \3.i\ and HOI hold. Then the expected number of iterations that 
Alaorithm \S.l\ takes until ||V/(X^)|| < e occurs is bounded as follows 


E(iV,) < 


2p 


M , 

^ + log^ 


i-e 


(2p-l)2[e2 ^^V«o(0.5L + k 

where M = (O-SL+k) ^ constant independent of p and 


e. 


Remark 3.1 IFe note that the dependeney of the expeeted number of iterations on e is of the 
order 1/e^, as expected from a line-search method applied to a smooth nonconvex problem. The 
dependency on p is rather intuitive as well: if p = 1, then the deterministic complexity is 
recovered, while asp approaches 1/2, the expected number of iterations goes to infinity, since the 
models/directions are arbitrarily bad as often as they are good. 


3.2 The nonconvex case, general descent 

In this subsection, we explain how the above analysis of the line-search method extends from 
the nonconvex steepest descent case to a general nonconvex descent case. 

In particular, we consider that in Algorithm 13.11 s^ = akdf (instead of —Ukg^), where d^ is 
any direction that satisfies the following standard conditions. 
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There exists a constant /3 > 0, such that 

{d 


k\Tgk 


< -/3, yk. 


(19) 


yk. 


\\dH-\m 

• There exist constants ki,K 2 > 0, such that 

The sufficient decrease condition (fTTl) is replaced by 

f{x'^ + akd’^)<f{x^)+ake{d^fg\ 

It is easy to show that a simple variant of Lemma l3.II applies. 

Lemma 3.2 Let AssumMion AS.B hold. Consider Alaorithm \S.l\ with and sufficient 

decrease condition (EH). Assume that d^ satisfies (fT9l) and (l 20 l) . Then, for every realization of 
the resulting algorithm, if iteration k is true (i.e. Ik holds), and if 

ffil-e) 


( 20 ) 


( 21 ) 


ak<C = 

then m holds. In other words, when 


0.5Lk2 + K 

holds, any true iteration is also a successful one. 
Proof. The first displayed equation in the proof of Lemma 13.11 provides 

fix’^ + akd^) < f{x^^) + ak{dY^f{x^) + ^al\\d^fi 
Applying the Cauchy-Schwarz inequality, m and the conditions ()20p on d^ we have 
/(x^ + Ofcd^) < /(x^) + afc(d^)^[V/(x^) -/] + afc(d^)'^ff'= + 

< /(x^) + afclld^ll • ||V/(x^) - 5^11 + ak{d'^Y'g'^ + ^al\\d^\\'^ 

< f{x^) + alK\\dl^\\\\g^\\ Pakid^Yg^ + ^alK2\\dfi\\\\g^\\ 

= f{x^) + akid^fg^ + a^lld^II 115^11 (k + K2^) . 


( 22 ) 


It follows that (I21D holds whenever 

oz-kid!^)'^ 9^ + 

or equivalently, since ak > 0 , whenever 


I 


« + '^ 2-1 ) < ak0{d^'f . 


<»l=nill/ll U + «2fj fi-a-SXdW 

Using dUD, the latter displayed equation holds whenever ak satisfies 


□ 


We conclude this extension to general descent directions by observing that if A: is a true and 
successful iteration, using the sufficient decrease condition (l 2 T]l . the conditions (fT^ and (1201) on 
dfi and (fTOjl . we obtain that 


fk+l > /fc + 


6 >Ki^||V/(x^)|pafc 

(1 + Kttmax)^ 


Hence, Assumption 12.11 holds for this case as well and the remainder of the analysis is exactly 
the same as for the steepest descent case. 
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3.3 The convex case 


We now analyze the expected complexity of Algorithm 13.11 in the case when f{x) is a convex 
function, that is when the following assumption holds. 

Assumption 3.3 / G is convex and has bounded level sets so that 

ll^: — a:*II < D for all x with f{x) < f{x^), (23) 

where x* is a global minimizer of f. Let f* = f{x*). 

In this case, our goal is to bound the expectation of - the number of iterations taken by 
Algorithm 13.11 until 

-r<e (24) 

occurs. We denote /(A^) — f* by and define - -h- Clearly, W is also the number of 
iterations taken until > ^ = occurs. 

Regarding Assumption 12.11 Lemma l3.ll provides the value for the constant C, namely, that 
whenever Ak < C with C = then every true iteration is also successful. We now show 

that on true and successful iterations, Fk is increased by at least some function value h{Ak) for 
all k < W. 


Lemma 3.3 Let Assumvtions \3.‘A and \d.iA hold. Consider any realization of Alaorithm U71\ For 
every iteration k that is true and suecessful, we have 

9ak 


fk-\-l A fh 7^9/1 I \o ■ 

Proof. Note that convexity of / implies that for all x and y, 

fix) - fiy) > Xf{y)^{x - y), 

and so by using x = x* and y = x^, we have 

-Al = fix*) - fix^) > Vfix’^)^ix* - x^) > -D\\Vfix^)\ 

where to obtain the last inequality, we used Cauchy-Schwarz inequality and 
is a true iteration, (fTHjl further provides 

4A£<||V/(x‘)||<(l+Ka, 


(25) 


. Thus when k 


When k is also successful 
J a/ 


Odk 


./^2 


Ai - AJ,. = /(.«) - /(.-=«) > 

Dividing the above expression by we have that on all true and successful iterations 


> 


Oak 


A 


> 


Oak 


A{_^^ A{ T)2(l + Kamax)^ A{^, L>2(1 + fcajj^^x)^ 


k+1 


since A^ > A|._|_j^. Recalling the definition of fk completes the proof. 


□ 


Similarly to the nonconvex case, we conclude from Lemmas 13.11 and 13.31 that for any real¬ 
ization of Algorithm 13.11 the following have to happen. 
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• If A: is a true and successful iteration, then 



and 


«fc+i= 7 


• If < C, where C is defined in (IlSp . and iteration k is true, then it is also successful. 

Hence, Assumption 12 .1 1 holds and the process {Ak, Fk} behaves exactly as our generic process 

Q-® in Section [231 with C defined in (ITSl) and the specific choice of h{Ak) = -p' 

Theorem 12.11 can be immediately applied together with the above expressions for C, h{C) 
and Tg, yielding the following complexity bound. 

Theorem 3.2 Let Assumvtions \3.R 1 7. <11 and \3.tA hold. Then the expected number of iterations 
that Algorithm, \S.1\ takes until f{X^) — f* < e occurs is bounded by 



)^£>^(0.5L+k) 


where M = 


is a constant independent of p and e. 


max 




Remark 3.2 We again note the same dependence on e in the complexity bound in Theorem \3TA 
as in the deterministic convex case and on p, as in the nonconvex case. 

3.4 The strongly convex case 

We now consider the case of strongly convex objective functions, hence the following assumption 
holds. 


Assumption 3.4 / S is strongly convex, namely, for all x and y and some p > 0, 


f{x) > f{y) + Xf{y)^{x -y) + ^\\x- yf. 


Recall our notation = f{X^) — f*. Our goal here is again, as in the convex case, to bound 
the expectation on the number of iteration that occur until < e. In the strongly convex case, 
however, this bound is logarithmic in i, just as it is in the case of the deterministic algorithm. 

Lemma 3.4 Let Assumption \3.f \ hold. Consider any realization of Algorithm \3.1\ For every 
iteration k that is true and successful, we have 



(26) 


or equivalently. 



(27) 


19 











Proof. Assumption I.S.4I implies, for x = and y = x*, that [see [15], Th 2.1.10] 


i < t ^|| V /( x ^)||2 


2/i 


or equivalently, 


^ IIW(a:^)ll < (1 + Kamax)||/||, 

now follows from the sufficient 

□ 


where in the second inequality we used (fTHjl . The bound 
decrease condition (Ill- 


Note that from ([26l) we have that if > 0 and > (1 + Kctmax)^/(2^0) then the iteration 
is unsuccessful. Hence, for an iteration to be successful we must have < (1 + Kamax)^/ {2y9). 
We also know that a true iteration is successful when < C, where C defined in (IlSp . assuming 
that C < (l + Kctmax)^/To simplify the analysis we will simply assume that this inequality 
holds, by an appropriate choice of the parameters, which can done without loss of generality. 
We now define = log and = log -, and the hitting time is the number of 

iterations taken until < e. 

As in the convex case, using Lemmas 13.11 and 13.41 we conclude that, for any realization of 
Algorithm 13.11 the following have to happen. 


If A: is a true and successful iteration, then 

fk+i > fk- log ^1 - 

and 


2ye 


(1 + KCKmax)' 




«fc+l= 

• If < C, where C defined in (I18p . and iteration k is true, then it is also successful. 

Hence, again, Assumption 12.11 holds and the process {Ak, Fk] behaves exactly as our generic 
process (I2|)-(l3|) in Section YTM with C defined in (fTsp and the specific choice of 


h{Ak) = - log 1 - 


2iie 


(1 + Kamax)' 


;Ak 


By using the above expressions for C, h{C) and F^, again as in the convex case, we have the 
following complexity bound for the strongly convex case. 


Theorem 3.3 Let Assumptions\3f^ I,?.HI and \3.4\ hold. Then the expected number of iterations 
that Alaorithm \3.1\ takes until f{X^) — /* < e oceurs is bounded by 


E(Ae) < 


2p 


(2p-l)2 


M log ( - ) + log. 


1-e 


\ao(0.5L + k ) 


where M = — log ( 1 — tt-t—® constant independent of p and e. 

Remark 3.3 Again, note the same dependence of the eomplexity bound in Theorem \3.^ on e 
as for the deterministic line-search algorithm, and the same dependence on p as for the other 
problem classes discussed above. 
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4 Probabilistic second-order models and cubic regularization 
methods 

In this section we consider a randomized version of second-order methods, whose deterministic 
counterpart achieves optimal complexity rate mu- As in the line-search case, we show that in 
expectation, the same rate of convergence applies as in the deterministic (cubic regularization) 
case, augmented by a term that depends on the probability of having accurate models. Here we 
revert back to considering general objective functions that are not necessarily convex. 

4.1 A cubic regularization algorithm with random models 

Let us now consider a cubic regularization method where the following model 

mk{x^ -b s) = fix'") + s^g'" + + yllsf, (28) 

is approximately minimized on each iteration k with respect to s, for some vector g'" and a 
matrix b'" and some regularization parameter a'" > 0. As before we assume that gk and bk are 
realizations of some random variables Gk and Bk, which imply that the model is random and we 
assume that it is sufficiently accurate with probability at least p; the details of this assumption 
will be given after we state the algorithm. 

The step s'" is computed as in mu to approximately minimize the model (I28p . namely, it is 
required to satisfy 

V + is'")^b^s'" + = 0 and (s^)^6^s" + ak\\s'^f > 0 (29) 

and 

WVrukix'" + s'")!! < K0mm{l, ||s^||}||/||, (30) 

where G (0,1) is a user-chosen constant. 

Note that (1291) is satisfied if s'" is the global minimizer of the model mk over some subspace; in 
fact, it is sufficient for s'" to be the global minimizer of ruk along the line os'" [8^1 Condition (|30ll 
is a relative termination condition for the model minimization (say over increasing subspaces) 
and it is clearly satisfied at stationary points of the model; ideally it will be satisfied sooner at 
least in the early iterations of the algorithm [5]. 

The probabilistic Adaptive Regularization with Cubics (ARC) framework is presented below. 


Algorithm 4.1 An ARC algorithm with random models 


Initialization 

Choose parameters 7 G (0,1), 6 G (0,1), Umm > 0 and Kg G (0,1). Pick initial x^ and 
<7o > <7min- Repeat for A: = 0,1,..., 

^Note that a recently-proposed cubic regularization variant [5] can dispense with the approximate global 
minimization condition altogether while maintaining the optimal complexity bound of ARC. A probabilistic 
variant of can be constructed similarly to probabilistic ARC, and our analysis here can be extended to provide 
same-order complexity bounds. 


21 



1. Compute a model 

Compute an approximate gradient and Hessian and form the model (I28p . 


2. Compute the trial step s^ 

Compute the trial step s^ to satisfy (l2^ and (1301). 


3. Check sufRcieut decrease 

Compute f{x^ + s^) and 


f{x^)-f{x^ + s^) 
f{x^) — mk{x^ + s^) 


4. Update the iterate 

Set 




xk _j_ gfc pj^> 0 [k successful] 
otherwise [k unsuccessful] 


5. 


Update the regularizatiou parameter 

Set 


O'k+l 


max{7Crfc, CJmin} 
otherwise. 


if pk>0 


(31) 


(32) 


(33) 


Remark 4.1 Typically (see e.g. w one would further refine (1321) and dMI) by distinguishing 
between successful and very successful iterations, when pk is not just positive but close to 1. 
It is beneficial in the deterministic setting to keep the regularization parameter unchanged on 
successful iterations when pk is greater than 9 but is not close to 1 and only to decrease it when 
Pk is substantially larger than 6. For simplicity and uniformity of our general framework, we 
simplified the parameter update rule. However, the analysis presented here can be quite easily 
extended to the more general case by slightly extending the flexibility of the stochastic processes. 
In practice it is yet unclear if the same strategy will be beneficial, as “accidentally” bad models 
and the resulting unsuccessful steps may drive the parameter ak to be larger than it should be, 
and hence a more aggressive decrease of ak may be desired. This practical study is a subject of 
future research. 


Remark 4.2 We have stated Algorithm \f.l\ so that it is as close as possible to known/deterministic 
ARC frameworks for ease of reading. We note however, that it is perfectly coherent with the 
generic algorithmic framework, Alaorithm \2. R if one sets ak = l/ufc and pk> 0 as the sufficient 
decrease condition. We will exploit this connection in the analysis that follows. 


The requirement of sufficient model accuracy considered here is similar to the definition of 
probabilistically fully quadratic models introduced in [T], though note that we only require the 
second-order condition along the trial step s^. 

Defiuitiou 4.1 We say that a sequence of random models and corresponding directions {M^} 
is (p)-probabilistically ’’sufficiently accurate” for Algorithm \4.1\ if there exist constants Kg and 
kh such that for any corresponding random sequence {Ak = X^}, the random indicator 

variables 

Ik = l{\\Vf{X’^)-G'^\\<Kg\\S'^f and \\{H{X'^) - B’^)S^\\ < KnWS^f} 
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satisfy the following submartingale-like condition 

P{h = > p, 

where = (7(Mo,..., M^-i) is the a-algebra generated by Mq, ..., M^-i- 

As before, for any realization of Algorithm 14.11 we refer to iterations k when occurs as 
true iterations, and otherwise, as false iterations. Hence for all true iterations k, 

||V/(x^)-/II < and ||(if(x'=)-/)/|| < Ki^||/f. (34) 

For the remainder of the analysis of Algorithm 14.11 we make the following assumption. 

Assumption 4.1 The sequence of random models and corresponding directions {Mi^,Sk}, gen¬ 
erated in Algorithm ic .1\ is {p)-probabilistically ’’sufficiently accurate” for the corresponding ran¬ 
dom sequence {Ak = 1/Sfc, with p > 1/2. 

Regarding the possibly nonconvex objective /, in addition to Assumption 13.21 we also need 
the following assumption. 

Assumption 4.2 / € and has globally Lipschitz continuous Hessian H, namely, 

\\H{x) — H(y)\\ < Lh\\x — y\\ for all x, y G M” and some Lh > 0. (35) 

4.2 Global convergence rate analysis, nonconvex case 

The next four lemmas give useful properties of Algorithm 14.11 that are needed later for our 
stochastic analysis. 

Lemma 4.1 (Lemma 3.3 in [7]) Consider any realization of Algorithm EH Then on each 
iteration k we have 

f{x^) - mk{x’" + /) > ^CTfcll/f. (36) 

6 

Thus on every successful iteration k, we have 

f{x^)-f{x^+^)>iak\\s^\\\ (37) 

Proof Clearly, (IST)) follows from (1361) and the sufficient decrease condition (l3T]) - p^ . It remains 
to prove ()36p . Combining the first condition on step s^ in (I29p . with the model expression (12811 
for s = we can write 

f{x>^) - mkix’^ + /) = ^(/)^H"/ + ^cTfc||/|/ 

The second condition on s^ in (I29p implies {s^)'^B^s^ > —(Tfc||s^|/ which, when used with the 

above equation, gives dMD- □ 
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Lemma 4.2 Let Assvmvtions 1 3. ^1 and 4-2 hold. For any realization of Algorithm \4.1\ if iteration 
k is true (i.e., Ik = 1), and if 


then iteration k is also successful. 

Proof. Clearly, if — 1 >0, then k is successful by definition. Let us consider the case when 
Pk < 1; then if 1 — pfc < 1 — 0, fe is successful. We have from (IHTI) . that 

f{x^ + s^) — mk{x^ + s^) 
f {x^) — mk{x^ + s^) 

Taylor expansion and triangle inequalities give, for some € [x^,x^ + s^], 
f{x^ + s^) — mk{x^ + 

= [V/(x'=) - g^fs^ + - H{x^)]s^ + ^{s^f[H{x^) - b^]s'^ - 

< ||V/(x'=) - 5^11 • lls'^ll + - H{x^)\\ ■ + i||(Lr(x^) - b^)s^\\ ■ list'll - 

— (^9 2- ^ + 3Lh + 3kh — 2cr/j)^||s^|p, 

where the last inequality follows from the fact that the iteration is true and hence (I34p holds, 
and from Assumption 14.21 This and (1361) now give that 1 — Pfc < 1 — 0 when satishes (|38l) . □ 


“^Kg + KH + L + L}J 


(38) 


Note that for the above lemma to hold cJc does not have to depend on L. However, in 
what follows we will need another condition on ctc, which will involve L; hence for simplicity of 
notation we introduced Uc above to satisfy all necessary bounds. 


Lemma 4.3 Let Assumvtions \d.2\ and A.2 hold. Consider any realization of Algorithm \fTJ\ On 
each true iteration k we have 


P"ll > /l^l|v/(x^ + sfc)||, (39) 

V 0-fe + Ks 

where Kg = 2Kg + kh + L + Lh. 

Proof. Triangle inequality, equality Vmk{x^ + s) = g^ + b^s + <Tfc||s||s and condition (pO]) on s^ 
together give 

||V/(x^ + s^)|| < \\Vf{x^ + s^)-Vmk{x^ + s^)\\ + \\Vmkix’‘ + s>^)\\ 

< WV f{x’^ + s^) — g^ — b^s'^W + ak\\s^\\‘^ + Kgmm{l,\\s’^\\}\\g^\\. 
Recalling Taylor expansion of V/(x^) 

V/(x^ + s^) = V/(a:^) + [ H{x’^ + ts’^)s’^dt, 

Jo 
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and applying triangle inequality, again, we have 

\\Vf{x^ + s^)-g^ -b'^s^W < ||V/(x^)-5^11 + 

Jq[H{x’^ + ts^) — H{x^)]s’^dt + ||//(x^)s^ — 6^5^11 
< {Kg + ^Lh + kh} Ws’^W'^, 

where to get the second inequality, we also used (IMll and Assumption 14.21 
We can bound ||g'^|| as follows 

ll/ll < ||/-V/(x^)|| + ||V/(x^)-V/(x"+s")|| + ||V/(x"+s^)|| < ng\\s^f+L\\s^\\ + \\Vfix^+s^)\\. 

Thus finally, we can bound all the terms on the right hand side of (1401) in terms of ||s^|p and 
using the fact that kq G (0,1) we can write 

(1 — Kq)\\'V f{x^ + S^)|| < {2Kg + KfJ + L + Lj{ + (Tfc)||s^|P, 

which is equivalent to (l3^ . □ 


Lemma 4.4 Let Assumvtions \d.^ and A.2 hold. Consider any realization of Algorithm \^LT\ On 
each true and successful iteration k, we have 


f{x^) - /(x^+i) > 


(max{c7fc,crc})^/2 


||V/(x'=+')f/ 2 , 


(41) 


where Kf := ^^^ (1 — KgY/‘^am\n and Oc is defined in 


Proof. Combining Lemma 14.31 inequality (I37p from Lemma 14.11 and the definition of successful 
iteration in Algorithm 14.11 we have, for all true and successful iterations k, 
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fix'') - fix"*') > -(1 - nefr-.^-—J^\\Vfix>‘*')fl\ (42) 


Using that ak > <7min and that Kg < (Tc 


implies (i^T]l . 


□ 


The stochastic processes and global convergence rate analysis We are now ready to 
cast Algorithm 14.11 and its behavior into the generic stochastic analysis framework of Section [2j 
For each realization of Algorithm 14.11 we define 

ak = — and fk = /(x°) - f{x^), 

O'k 

and consider the corresponding stochastic process {Ak = l/Sfc,Ffc = f{X^) — f{X^)}. Let 
Fg = f{x^) — f* denote the upper bound on the progress measure Fk- 

As in the case of the line-search algorithm applied to nonconvex objectives, we would like to 
bound the expected number of iterations that Algorithm 14.11 takes until ||V/(Ai^)|| < e occurs. 
Here, however, for technical reasons made clear below, we count the number of iterations until 
a successful iteration results in x^~^^ such that ||V/(A^"''^)|| < e. Let W denote the (random) 
index of such an iteration. (Clearly, W thus defined is simply one less than the number of 
iterations that occur until ||V/(X^+^)|| < e.) 

Regarding Assumption 12.11 Lemmas 14.21 and 14.41 provide that the following must hold for 
any realization of Algorithm 14.11 
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• If A: is a true and successful iteration, then 


fk+l > /fc + 




(max{o-fc,cJc})3/2 


||V/(^fc+l)||3/2 


and 


ttfe+i = 7 


• If ttfc < C* = ^, where Uc is defined in (IHHIi . and iteration k is true, then it is also successful. 

Hence, once again. Assumption 12 .1 1 holds and the process {Ak, Fk] behaves exactly as our generic 
process ([2])-([3]) in Section [231 with C = — = , and the specific choice 

h{Ak) = Kf{mm{Ak,C}f^‘^e^^^. 


for all k < N^. 

Finally, the complexity result again follows from Theorem 12.11 and the expressions for C, 
h{C) and 


Theorem 4.1 Let Assumvtions 1,9.^ ij and 4A hold. Then the expected number of iterations 
that Algorithm\4.1\ takes until ||V/(A^+^)|| < e occurs is bounded by 


E{N^) < 


2p 


M 


(2p- 1)2 Ve3/2 


+ log 


^^9 + kh T T + Lfj 
CTo(l - 1/30) 


where M 


k/(1-1/36»)3/2 


is a constant independent of p and e. 


Remark 4.3 IFe note that the dependency on e in the above bound on the expected number 
of iterations is of the order e“3/2^ which is of the same order as for the deterministic ARC 
algorithm and is the optimal rate for nonconvex optimization using second order models m 
The dependence on p is, again, the same as in the case of line-search and it is intuitive. 


Remark 4.4 Theorem ?? stating that liminffe —oo||V/(A^)|| = 0 almost surely, holds for 
Algorithm \4.1\ since a similar proof applies. 


5 Random models 

In this section we will discuss and motivate the definition of probabilistically ’’sufficiently accu¬ 
rate” models. In particular, Definition 13.11 is a modihcation of the dehnition of probabilistically 
fully-linear models, which is used in [T]. Similarly, Definition 14. II is similar to that of probabilis¬ 
tically fully-quadratic models in [T]. These definitions serve to provide properties of the model 
(with some probability) which are sufficient for first-order (in the case of Definition 13.ip and 
second-order (in the case of Definition mi) convergence rates. 

We will now describe several setting where the models are random and satisfy our dehnitions. 
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5.1 Stochastic gradients and batch sampling 

In ID an adaptive sample size strategy was proposed in the setting where V/(x) = V/j(x), 

for large values of N. In this case computing V/(x) accurately can be prohibitive, hence, instead 
an estimate Vfs{x) = Vfi{x) is often computed in hopes that it provides a good estimate 

of the gradient and a descent direction. It is observed in [1] that if sample sets Sk on each 
iteration ensure that 

WVfs.ix’^) - V/(x")|| < /i||V/5,(x'=)|| (43) 

for some /r € (0,1), then using a fixed step size 

Ofc = a < (44) 


the step Sk = —aVfs^ix) is always a descent step and the line search algorithm converges with 
the rate 0(log(l/e)) if / is strongly convex. Clearly, condition (|45D implies that the model 
Mk{x) = f{x^) + Vfsf,{x^)~^{x — x^) is sufficiently accurate according to Definition 13.11 for the 
given fixed step size a. Hence Assumption 13 .1 1 on the models can be viewed as a relaxed version 
of those in [4], since we allow the condition (143 h to fail, as long as it fails with probability less 
than 1/2, conditioned on the past. Moreover, we analyze the practical version of line search 
algorithm, with a variable step size, which does not have to remain smaller than and we 
provide convergence rates in convex, strongly convex and nonconvex setting. 

Convergence in expectation of a stochastic algorithm is further shown in [4]. In particular, 
under the assumption that the variance of ||V/i(x)|| is bounded for all i and that E 5 [V/ 5 (a;^)] = 
Vf{x^), it is shown that, for computed after k steps of stochastic gradient descent with a 
fixed step size, E[/(X^)] converges linearly of /*, when f{x) is strongly convex and if \Sk\ - the 
size of the sample set Sk - grows exponentially with k. 

Here, again, our results can be viewed as a generalization of the results in [4]. Indeed, let us 
assume that E 5 [V/ 5 (a;^)] = V/(x^) for each x^ and let tk = \Sk\ - the size of the sample set Sk- 
Since variance of ||V/j(x)|| is bounded for all i, we have that Es'j.[||V/s'j.(x^) —V/(x^)||] < for 
some fixed w, where the expectation is taken over all random sample sets Sk of size tk- In other 
words, the variance of one sample of the stochastic gradient ||V/i(x)|| is bounded and hence the 
variance of Vfs^{x^) decays as the size of Sk increases. 

By Chebychev inequality 


mi|V/s,(x'^)-V/(x'^)|| >min{l/2,«,}||V/(x'=)||}< 


w 

min{I/2,afc}2||V/(x^)2|||5fcr 


If WVfsf.{x^) — Vf{x^)\\ < min{I/2, afc}||V/(x^)||, for a particular Xk and a sample set Sk, then 
by applying triangle inequality we have 

l|V/5,(x^) - V/(x^)|| < 

Hence the probability of the event ||V/ 5 (a:*') — V/(a:^)|| < ig ^t least 

1 - — - >1 - — - 

min{l/2,afc}2||V/(a:^)||2|S'fc| mm{l/2,ak}'^{l + ak)^\\V fs^ix’^WlSkl' 

hence as long as \Sk\ is chosen sufficiently large, then this probability is greater than 1/2 and 
Vfs{x^) provides us with a probabilistically sufficiently accurate model according to Definition 
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13.11 Hence the theory described in this paper applies to the case of line search based on stochastic 
gradient. Note that, on top of the results in [1], we not only analyze line search in nonconvex and 
convex setting, but also show the bound on the expected number of iterations until the desired 
accuracy is reached, rather than the expected accuracy after a given number of iterations. As 
we have shown earlier, this implies liminf-type convergence with probability one. Moreover, 
as mentioned on page [131 h is not difficult to extend our analysis to show that the number of 
iterations until the desired accuracy is reached has exponentially decaying tails. 

Analyzing complexity of methods in this setting in terms of the total number of gradient 
samples is a subject of some current research m- We leave the exact comparison that can 
be obtained from our results and those existing in current literature as future research, as 
this requires defining a sample size selection strategy and possible improvement of our results. 
Similarly, we leave for future research the derivations of the models in this setting that satisfy 
Definition 14.11 for the use within the ARC algorithm. 

5.2 Models based on random sampling of function values 

The motivation behind the notions of probabilistically fully-linear and fully-quadratic models 
introduced in [T] is based on derivative-free models, which are models based on function values, 
rather than gradient estimates. We will now show how such models fit into our framework. 

Let us first recall the dehnition of probabilistically fully-linear and quadratic models and 
pose it in the terms closest to the ones used in this paper 

Definition 5.1 1. We say that a sequence of random models {M^} is (p)-probabilistically 

fully-linear if there exists constant Kg such that for any corresponding random sequence 
Afc, , the random indicator variables 

![ =l{\\Xf{X’^)-G^\\<KgAk} 

satisfy the following submartingale-like condition 

P{li = l\FtLf) > P, 

where = a{Mo ,..., Mk-i) is the a-algebra generated by Mq, ..., M^-i- 

2. We call sequence {M^} is (p)-probabilistically fully-quadratic if there exist constants Kg 
and kh such that for any corresponding random sequence X^, the random indicator 
variables 

II = 1{||V/(A^) - < KgAl and \\H{X>^) - R^|| < khA^} 

satisfy the following submartingale-like condition 

P{ll = l|F,^i) > p, 

where F^^ = a{Mo ,..., Mk-i) is the a-algebra generated by Mq, ... , M^-i- 

The key difference between the conditions in Definition 15.11 and those in Definitions 13.11 and 
[Q is the right hand side of the error bounds - in the case of fully-linear and fully-quadratic 
models A^ is a random variable that does not depend on M^, but in the case of this paper, A^ 
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is replaced by in the case of Definition 13.11 and by ||5fc|| in the case of Dehnition 14.11 

In other words, the accuracy of the model has to be proportional to the step size which this 
model produces. Since in [T] trust region methods are analyzed instead of line search and ARC, 
Definition 15.11 is sufficient. 

Models in [T] are constructed by sampling function values in a ball of a given radius around the 
current iterate and in all cases construction of the k-th. model relies on the knowledge of 
the sampling radius. We will now show that, given a mechanism of constructing probabilistically 
fully-linear and fully-quadratic models for any sequence of radii (as described in m) , we can 
modify our line search algorithm and ARC algorithm, respectively, and extend the convergence 
rate analysis to utilize these models. 

Line-search with probabilistically fully-linear models Let us consider Algorithm 13.11 
and corresponding random sequence of iterates and step sizes Ak- If a given model M^. is 
fully-linear in B{X^,Ak^k) and ||Gfc|| > for some positive constant ka, then model Mk 

is sufficiently accurate, according to Definitions 13.11 

To achieve this, for instance, in nonconvex case, for all ||V/(A^)|| > e consider Ek < 
2 k maxf^l;; 1 } ’ '''^here Kg is the constant in the definition of fully-linear models. Then any fully- 
linear model Mk{x) is also sufficiently accurate, simply because ||V/(A^) — G^\\ < KgAk'^k < 
min{Mfc, 1}| implies ||Gfc|| > f > Sfc- Similar bounds can be derived for the convex and strongly 
convex cases. 

Consider the following example of a method that produces probabilistically sufficiently ac¬ 
curate models, based on the arguments above. Suppose we are estimating gradients of f{x) 
by a finite difference scheme using step size Ak'^ki with sufficiently small, and suppose we 
compute the function values using parallel computations. If some of the computations fail to 
complete (due to an overloaded processor, say) with some probability and the total probability 
of having a computational failure in any of the processors at each iteration is less than 1 / 2 , con¬ 
ditioned on the past, then we obtain probabilistically sufficiently accurate models. Note that, 
we do not assume the nature of the computational error, when such error occurs, hence allowing 
for the gradient estimate to be, occasionally, completely inaccurate. 

Another example can be derived from [T], where it is shown that sparse gradient and Hes¬ 
sian estimates can be obtained by randomly sampling fewer function values than is needed to 
construct gradient and/or Hessian by finite differences. Using this sampling strategy, prob¬ 
abilistically fully-linear and fully-quadratic models can be generated at reduced computation 
cost. Here again, choosing sampling radius to equal = Ak'^ki with sufficiently small Ek will 
guarantee that the models are also probabilistically sufficiently accurate. 

We now address a more practical approach, when estimates Ek are not chosen to be small 
enough a priory, but are dynamically decreased, as another parameter in the algorithm. We 
will outline how our theory can be extended in this case. Consider the following modification of 
Algorithm 13.11 

Algorithm 5.1 Line-search with probabilistically fully-linear models 


Initialization 

Chose constants 9 G (0,1), 7 G (0,1), Omax > 0 and ka > 1- Pick initial x® and 
ao < Omax, ^0- Repeat for k = OA, ■ ■ ■ 
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1. Compute a model 

Compute a model m^, which is probabilistically fully-linear in B{x^,akCk) use it to 
generate a direction . 

2. Check model accuracy 

If Wd^W > KAf,k, then set the step s^ = —aug^ and continue to Step 3. 

Otherwise, = x^, a^+i = ak, Ck+i = fk/i^A, return to Step 1. 

3. Check sufRcieut decrease 

Check if 

(45) 

4. Successful step 

If (l^5|l holds, then x^~^^ := x^ — a^g^ and Ofc+i = minlofc/y, Omax}- 

5. Uusuccessful step 

Otherwise, x^^^ = x^. 

Ofe+i = 7«fc- 


In the above algorithm, at each iteration we maintain which is expected to be an un¬ 
derestimate of the norm of the descent direction, up to a constant, ka- The algorithm then 
uses 5k = oikik as the radius for constructing fully-linear models. After the model is produced, 
condition H^fcll > K^ik is checked. If this condition holds, the algorithm proceeds exactly as the 
original version, but if this condition fails, then ^k is reduced by a constant (ka is a practical 
choice, but any other constant can be used) and the iteration is declared to be unsuccessful 
(hence x^^^ = x^), and the step size Uk remains the same. 

Let us consider different possible outcomes for each iteration k for which ||V/(x^)|| > e. 

From our analysis above, we know that if ft < 75 —^- and the model is fully linear, then 

Ibfell ^ Cfc) hence the model is also sufficiently accurate and the iteration of Algorithm 15.11 
proceeds as in Algorithm 13.11 Since ^k is never increased, then, once it is small enough, the 
analysis of Algorithm 15.11 can be reduced to that of Algorithm 13.11 Then what remains is to 
estimate the number of iterations that Algorithm 15.11 takes until ^k ^ 2 iCa — ll^/(®^)ll — ^ 
occurs. 

While f,k is not sufficiently small, we can have the following outcomes: 1) Us'fcH < K^ik^ in 
which case f^k is reduced, 2 ) the model is not fully linear and ||g'fc|| > K^fk, hence the model 
may not be sufficiently accurate, but ^k is not reduced and 3) the model is fully-linear and 
\\gk\\ > i^Af,k, hence the model is also sufficiently accurate. Hence with probability at least p, 
is reduced or the model is sufficiently accurate. It is possible to extend the definition of our 
stochastic processes and their analysis to compute the upper bounds on the expected number 
of iterations Algorithm 15.11 takes until ||V/(a:^)|| < e occurs. This bound will be increased by 
adding a constant times the number of iterations it takes to achieve fi. < 75 — -, which is 

^kZgOtmax. ' 

0(log(l/e)). Again, similar analysis can be carried out for the cases of convex and strongly 
convex functions. 

ARC with probabilistically fully-quadratic models Let us consider Algorithm 14.11 
In this case, in the same vein with line-search, we consider setting in the Definition 15.11 of 
probabilistically fully-quadratic models, to a sufficiently small value or adjusting it in the run 
of the algorithm so as to ensure that when the model is fully-quadratic, it is also sufficiently 
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accurate (at least asymptotically). We will make these two approaches to the choice of 
more precise in what follows. To this end, we need a new variant of Lemma 14.31 for the case of 
probabilistically fully-quadratic models. 


Lemma 5.1 Let Assumvtions \S.S\ and \4-^ hold. Consider any realization of Algorithm \4.1] where 
we generate models that are p-probabilistically fully-quadratie aceording to Definition \5.1l Then 
on each iteration k in which /| = 1, we have 

(1 - K0)||V/(x''+ s^)|| < (2Kg + K//)4max{4,l} + (L + L// + (Tfc)||s''||^. (46) 


In particular, if e ^ (0,1], max{L, Lh} > 1, and 

(1 - K0)e 


Sk < 


max {2{2Kg + kh),L Lh + (Tk} ’ 

then on each iteration k with ||V/(a;^ + s^)|| > e and in which we have ||s^|| > 6^- 

. Then iff (= iD. ll. maxlL. Lul ^ 

^min^ 


(47) 


Assume now that dk = ^. Then if e & (0,1], m.ax{L, Ljj} > 1, and 

(1 - Kg)a^ 


ik < 


:=6, 


max {2(2«;g + kh),L Lh + 
then on each iteration k with ||V/(x^ + s^)|| > e and in which = 1, we have ||s*^|| > 5k- 


(48) 


Proof. It follows from Definition 15.11 that on each realization of Algorithm 14.11 we have 

\\Vf{x’^)-g^\\<Kg5l and \\H{x’^) - b^\\ < khS^ (49) 

The proof of (I46p now follows identically to the proof of Lemma 14.31 if one uses (|49p instead of 

(IMl). 

The choice of dk in (14711 implies A 1 and so ||s^|| > 5k trivially holds when ||s^|| > 1. 
When ||s^|| < 1, ||V/(x^ + s^)|| > e, and dk < 1, (|i6p implies 

(1 ~ i^e)^ ~ (2/tg + KH)dk A (L + Lh + <7^)115^11. 

Now the condition (|17p on dk implies (L + Lh + <7fc)||s^|| > (1 — K0)e/[2(2Kg + k-h)]- Applying 
again the upper bound on dk provides ||s^|| > dk- 

Finally, if dk = and using > (Jmin for all k due to the algorithm construction, (|l8]l 
implies (IT71) . □ 


The second part of Lemma 15.11 provides that if p-probabilistically fully-quadratic models 
are generated with dk chosen sufficiently small so that (14711 holds, then the models are also p- 
probabilistically sufficiently accurate. Thus Algorithm 14.11 can be run with models sampled in 
this way and the analysis carries through as before. For example, as in the case of linesearch, 
g^ and b^ could be generated by (sufficiently accurate) finite-difference schemes using function 
values, where computations are done in parallel and where the total probability of computational 
failure in any of the processors at each iteration is less than 1/2. 

Note however, that the bound that dictates the choice of a suitably small dk depends on 
problem constants that may not be known a priori. Thus it would be better ~ and computation¬ 
ally more efficient - to adjust dk during the run of Algorithm 14.11 A modification of Algorithm 
O that allows this is given next, and can be viewed as the analogue for ARC of the line-search 
Algorithm 15.11 
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Algorithm 5.2 ARC with probabilistically fully-quadratic models 


Initialization 

Choose parameters ayain > 0, 7 G (0,1), 9 € (0,1), 0 < Kg < 1 and k/\ > 1. Pick a starting 
point x^, a starting value uq > Umin and .^0 > 0. Repeat for A: = 0,1,.. 


1. Compute a model 

Compute a model which is probabilistically fully-quadratic in B , and hence gen¬ 

erate approximate gradient g^ and Hessian b^. 

2. Compute the trial step s^ 

Compute the trial step s^ to satisfy (l 2 ^ and m- 


3. Check model accuracy 

If ||s^|| > KA^k '■= then go to Step 4- 

Otherwise, set = x^, ak+i = (^k, ^fc+i = ^k/n a, and return to Step 1. 


4. Check sufficient decrease 

Compute f{x^ + and 


Pk — 


f{x^) — mk{x^ + s ^)' 


5. Update the iterate 

Set 


_ ( x^ + s^ if Pk > 9 [k successful] 
x^ otherwise [k unsuccessful] 


6. Update the regularization parameter Ck 

Set 

_ ( max{7crfc,crmin} if Pk>9 
o'k+i — ^ 1 ^ otherwise. 


Algorithm 15.21 updates in order to obtain an underestimate 6k ■= f,k/^k on the length of 
the step s^. It constructs probabilistically fully-quadratic models in and checks 

whether ||s^|| > KA6k- If that is the case, then the iteration of the above algorithm proceeds as 
(Algorithm 14.ip before; note that then, if the model is fully quadratic then it is also sufficiently 
accurate. Otherwise, if the step is too short, then f^k is decreased by ka, and remain 
unchanged and a new model is generated (within the smaller ball). 

Let us consider the behavior of Algorithm 15.21 while ||V/(x^ + 'S*^)|| > e. It follows from 
the last part of Lemma 15.11 that since is independent of k and f^k is never increased in the 
algorithm, if kaCj ^ Ce for some j, then will remain below this threshold for all subsequent 
iterations k > j] from this j onwards, whenever the model is fully quadratic, then ||s^|| > KA6k 
and the model is also sufficiently accurate. Thus from iteration j onwards, Algorithm 15. 2l reduces 
to Algorithm 14.11 and the complexity analysis is the same as before. It remains to estimate the 
size of j, namely, the number of iterations Algorithm 15.21 takes until f^k < or ||V/(x^-|-s^)|| < e. 

Similarly to the linesearch analysis of possible outcomes above, we can argue that while 
is not sufficiently small, at least with probability p, f^k is reduced or the model is sufficiently 
accurate. Thus, extending our earlier ARC analysis (and definitions of stochastic processes, etc) 
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to account for the updates as well, we would find that the complexity bound for Algorithm 
15.21 is essentially that of Algorithm 14.11 plus a 0(log(l/e)) term (coming from log(^o/^e) log ka) 
that accounts for the number of iterations to drive below 

6 Conclusions 

We have proposed a general algorithmic framework with random models and a methodology for 
analyzing its complexity that relies on bounding the hitting time of a nondecreasing stochastic 
process that measures progress towards optimality. Our framework accounts for linesearch and 
cubic regularization methods, for example, and we particularize our results to obtain precise 
complexity bounds in the case of nonconvex and convex functions. Despite allowing our mod¬ 
els to be arbitrarily inaccurate sometimes, the bounds we obtained match their deterministic 
counterparts in the order of the accuracy e. The effect of model inaccuracy is reflected by the 
constant multiple of the bound, which is a function of the probability that the model is suf¬ 
ficiently accurate. We have also briefly discussed ways to obtain probabilistically sufficiently 
accurate models as required by our framework. 

The results in the paper assume that the objective / is deterministic. Obtaining global 
rates of convergence results for similar algorithmic frameworks when / is stochastic is a topic of 
future research. Also, further exploring ways to efficiently generate probabilistically sufficiently 
accurate models may increase the applicability of our results to a diverse set of problems. 
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